Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
نویسندگان
چکیده
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.
منابع مشابه
Morphological Annotation of the Lithuanian Corpus
As the development of information technologies makes progress, large morphologically annotated corpora become a necessity, as they are necessary for moving onto higher levels of language computerisation (e. g. automatic syntactic and semantic analysis, information extraction, machine translation). Research of morphological disambiguation and morphological annotation of the 100 million word Lith...
متن کاملReview of statistical modeling of h using very large v
This paper presents state of the art language modeling (LM) of Lithuanian, which is highly inflected free word order language. Perplexities and word error rates (WER) of standard n-gram, class-based, cache-based, topic mixture and morphological LMs were estimated and compared for the vocabulary of more than 1 million words. WER estimates were obtained by solving a speaker-dependent ASR task whe...
متن کاملCache-based Statistical Language Models of English and Highly Inflected Lithuanian
This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consis...
متن کاملOn efficient training of word classes and their application to recurrent neural network language models
In this paper, we investigated various word clustering methods, by studying two clustering algorithms: Brown clustering and exchange algorithm, and three objective functions derived from different class-based language models (CBLM): two-sided, predictive and conditional models. In particular, we focused on the implementation of the exchange algorithm with improved speed. In total, we compared s...
متن کاملWord clustering effect on vocabulary learning of EFL learners: A case of semantic versus phonological clustering
The aim of this study is to determine the effect of word clustering method on vocabulary learning of Iranian EFL learners through a case of semantic versus phonological clustering. To this effect, 80 homogeneous students from four intermediate classes at an English institute in Torbat e Heydariyeh participated in this research. They were assigned to four groups according to semantic versus phon...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Informatica, Lith. Acad. Sci.
دوره 15 شماره
صفحات -
تاریخ انتشار 2004